Spatiotemporal neural network with attention mechanism for El Niño forecasts

To learn spatiotemporal representations and anomaly predictions from geophysical data, we propose STANet, a spatiotemporal neural network with a trainable attention mechanism, and apply it to El Niño predictions for long-lead forecasts. The STANet makes two critical architectural improvements: it learns spatial features globally by expanding the network’s receptive field and encodes long-term sequential features with visual attention using a stateful long-short term memory network. The STANet conducts multitask learning of Nino3.4 index prediction and calendar month classification for predicted indices. In a comparison of the proposed STANet performance with the state-of-the-art model, the accuracy of the 12-month forecast lead correlation coefficient was improved by 5.8% and 13% for Nino3.4 index prediction and corresponding temporal classification, respectively. Furthermore, the spatially attentive regions for the strong El Niño events displayed spatial relationships consistent with the revealed precursor for El Niño occurrence, indicating that the proposed STANet provides good understanding of the spatiotemporal behavior of global sea surface temperature and oceanic heat content for El Niño evolution.


Scientific Reports
| (2022) 12:7204 | https://doi.org/10.1038/s41598-022-10839-z www.nature.com/scientificreports/ SINTEX-F5, a leading dynamic forecasting system. And the 2D CNN provides a skillful forecast of ENSO events up to 1.5 years in advance. Therefore, the unified 2D CNN model 15 , which is the state-of-the-art (SOTA) method, is implemented as the baseline model and is compared with the results of the model proposed in our study. Spatiotemporal forecasting using geophysical data has its own specificity, such as high-dimensional, often limited in scope, and temporally correlated, which should be considered in building deep neural networks. In this study, we propose STANet, a spatiotemporal neural network with a trainable attention mechanism that can learn the characteristics of spatiotemporal geophysical data over time and test its predictability by applying it to El Niño forecasts. In particular, through the effective receptive field and dilated convolution 19 as well as attention mechanism in recurrent neural network 20 , it is possible to obtain greater understanding of the spatiotemporal relationship between global-scale SST and HC anomalies that affect the occurrence of El Niño. In contrast to the 2D CNN, which does not consider any temporal order of given input, the proposed STANet employs the 3D receptive field blocks with 2D + time convolution filters to learn spatiotemporal patterns from the short-term input (three months) and the stateful LSTM module with an attention mechanism to learn the temporal order of the long-term sequences. The receptive field block efficiently increases the receptive field using dilated convolution layers and residual skip-connection, and it allows the model to learn spatiotemporal features from larger regions. The stateful LSTM preserves long-term contextual information across sequential input data by keeping its hidden state and cell state even after mini-batch inference. In addition, the attention mechanism allows the proposed model to process the geophysical data and focus on more relevant regions at specific time steps precisely in the forecasting task. To examine the spatiotemporal feature representation capability according to the architectural characteristics of STANet, ablation experiments are performed through the combination of individual network modules of the proposed STANet for the El Niño prediction task. In addition, to evaluate how well the STANet understand the spatiotemporal behavior of SST and HC for El Niño prediction, we will look into whether the trained attention map and the precursor for El Niño occurrence are consistent.

Results
Overall predictability. Figures 1 and 2 show the time series of the correlation coefficient (CC), root-meansquare-error (RMSE), and density scatter of the calendar month classification to evaluate the performance of the proposed STANet and the SOTA model (2D CNN) in El Niño prediction for the 23-month forecast lead time from 1980 to 2017. The red and black lines represent the STANet and 2D CNN model, respectively. The overall El Niño predictability of the proposed STANet was improved for the 12-month forecast lead time with the CC improving by 5.8% and the RMSE decreasing by 0.07 on average compared to the SOTA model. Classification accuracy was also improved by 13% during the entire test period (see Figs. 2, 7).
Spring predictability barrier. The spring predictability barrier, a unique feature of ENSO 21 means that ENSO forecasting accuracy decrease dramatically during boreal spring when an ENSO event begins to grow 8,22 . To investigate the predictability for forecasts initiated in boreal spring, the forecasts with input anomalies dur-

Time-series prediction.
To test the spatiotemporal feature representation capability of the STANet based on its architectural characteristics, ablation experiments were carried out using the proposed STANet's network modules for the El Niño prediction task. In addition, to examine how well the STANet understands the spati-   www.nature.com/scientificreports/ otemporal behavior of SST and HC for El Niño prediction, we will look into the consistency of the trained attention map and precursor for El Niño occurrence.
The following Fig. 4 shows the Nino3.4 time series over the test period of 1996-2016 to examine how much the predictability of strong El Niño events was improved in terms of Nino3.4 indices. In addition, Fig. 5 shows the evolution of the Oceanic Nino Index during the two strong El Niño events of 1997/1998 and 2015/2016 through the time series of the Nino3.4 index, respectively. In Figs. 4 and 5, the red line represents the predicted time series of the proposed STANet, and the black, blue, and turquoise lines represent OISST v.2 observations, GODAS test data, and the predicted time series of the SOTA 2D CNN model, respectively. From top to bottom, time series with forecast lead times of 1, 3, 6, and 12 months are shown. As the forecast lead time increases, the peak value is underestimated, and it is difficult to expect predictability concerning the 12 forecast lead month.  Fig. 6 of trained attention maps that acted as precursors to strong ENSO events, shaded backgrounds and contours are spatial distributions for SST and HC anomalies several months prior to the event occurrence (see each figure title) used as inputs for El Niño and La Niña predictions, and the legends of SST and HC anomalies are displayed on the right and overlaid contours of individual figures, respectively. And the white and blue squares show the region of Nino3.4, and the red gradient on the right panels shows the region where the proposed STANet is most affected by the input data given for Nino3.4 prediction.
A spatial region of attention is selected from among the whole area of spatial data, and values in this region are further processed with a higher weight than the rest of the spatial data. As a result, the attention map in which a specific area is highlighted in red indicates the area that had the most influence on the ENSO event. The attention maps clearly show where the SST and oceanic HC are important for a future ENSO evolution. The   24,25 .
In addition, the northern hemisphere off-equatorial SST is an important precursor of El Niño development 26 , which seems to be captured in the attention map. Interestingly, the attention map for the 2009 El Niño is somewhat different from the other two events. That is, the attention map located to some extent in the central Pacific, and there is a clear signal in the northern tropical Atlantic Ocean, which may be related to the fact that the 2009 El Niño is a central Pacific-type El Niño 27 .
The attention map for the La Niña events also tends to capture the equatorial Pacific signals according to the recharge oscillator 24,25 . However, for the 2010 La Niña events, the eastern Pacific and Atlantic warm pools are attentive regions, which may be related to the connection with the Atlantic warm pool 28 .
The trained attention in this way shows spatial relationships consistent with the revealed El Niño dynamics, indicating that the proposed STANet provides a good understanding of the spatiotemporal behavior of the global SST and oceanic HC for El Niño prediction. In other words, the proposed STANet can determine which area of the input SST and oceanic HC anomalies to focus on for El Niño forecasting.  Table 2). Each figure shows the time series of CC and RMSE on Nino3.4 index prediction for the 23-month forecast lead time and the hit accuracy of the calendar month classification for the test period of 1980-2017. In comparison to the proposed spatiotemporal neural network with attention, "STANet", each result of blue, black, and green solid lines and bar graphs shows the cast-off using GRU instead of stateful LSTM, not using the attention mechanism, and not using RFB for encoding globally spatial information, respectively.

Ablation experiments.
When the results of the ablation experiments were compared comprehensively using a combination of individual network modules, spatiotemporal feature representation with extended receptive field (RFB) provided the greatest performance improvement on the El Niño forecasts. It means that the prediction of El Niño is affected by a larger area including its Nino3.4 region, and a long time of sequence signals. Although temporal feature learning with GRU and spatial feature learning with attention mechanisms both improve performance, it is clear that their effects are maximized when the area and sequence of spatiotemporal feature learning are expanded simultaneously. "STANet" is the proposed architecture. "GRU" is an alternative model using GRU module 29 instead of the LSTM module. "no Attention" indicates the STANet without attention mechanism. "no RFB" indicates the STANet using 3D convolution layers instead of the receptive field blocks. Lower RMSE and higher CC indicate better performance in Nino3.4 index prediction. "GRU" is an alternative model using GRU module 29 instead of the LSTM module. "no Attention" indicates the STANet without attention mechanism. "no RFB" indicates the STANet using 3D convolution layers instead of the receptive field blocks.

Discussion
The STANet, spatiotemporal neural network with attention, was proposed in this study for predicting extreme events by learning spatiotemporal feature representations from geophysical data, and its performance was evaluated by applying it to El Niño forecasts. The architecture of the STANet has a wider receptive field with trainable attention for globally extended spatial feature encoding and more long-term memory for temporal sequential feature encoding. When compared to the SOTA model, the proposed STANet for the multitasks of Nino3.4 index time series prediction on 12-month forecast lead time and the corresponding calendar month classification for El Niño forecasts in the test period of 1980-2017 improved the performance by 5.8% and 13%, respectively.
In addition, it showed a 10% improvement in performance for 9-month forecast lead forecasts compared to the SOTA model in the "spring predictability barrier" experiment. Furthermore, when the predicted time series of the Nino3.4 index for strong El Niño, an extreme event that occurred during the test period, was examined, it was confirmed that the peak value was predicted more accurately than the SOTA model.
Trained attention map allows interpreting the precursors for El Niño and La Niña occurrences for extreme ENSO events through the visualization of the trained attention map, which has the ability to figure out which of the spatial region to focus on. The attention map displays the representative El Niño precursor in the spring season, where the HC in the equatorial region and SST in the North Pacific are known as the Pacific Meridional Mode pattern. As a representative of the Central Pacific El Niño event in 2009, the attention map captures the precursor, North tropical Atlantic (NTA) SST signal. In general, La Niña has more diverse precursors than El Niño. La Niña could be linked to a wide range of developmental patterns. This may be related to a wide variety of La Niña developmental patterns. Overall, the result is also in good agreement with the climate dynamics for ENSO evolution.
In comparison to previous deep learning-based El Niño forecasting models that discovered patterns and trends buried within the data, it is believed that the proposed STANet understands the spatiotemporal behavior of global SST and HC as well as the nonlinear spatiotemporal relationship related to the mechanism of ENSO occurrence. There are also previous studies that showed good skills in the statistical model developed for medium-and long-range ENSO prediction 30,31 with a correlation coefficient of 0.5 or more in 12-month and 24-month ahead prediction, respectively. While the proposed STANet was trained with historical simulation data of CMIP5, the above two statistical models were fitted with the National Centers for Environmental Prediction/National Center 40-year Reanalysis 32 data and OISST v2 observation data, respectively. Therefore, it is difficult to directly compare the prediction performance of each ENSO forecasting, but it seems that it will be necessary to make an effort to improve the long-range prediction accuracy of deep learning approaches as well.
In Fig. 5, both neural networks underestimate large changes in Nino3.4 after 12 months. The uncertainties of the climate parameters and the neural networks are a possible explanation on the difficulty of the longer lead-time prediction. The uncertainties increase in the variance of the prediction 33 . Ensemble approaches using multiple models are promising in reducing the uncertainties for the Nino3.4 prediction 12 . Another reason for the limited performance is the lack of training data set enough for the neural networks. Due to the nature of the spatial-temporal climate data, conventional image augmentation methods, such as flipping, rotation, and translation, cannot be used for training the neural networks for the Nino3.4 index prediction. Generation of climate data using generative models, such as generative adversarial networks, is a possible option for the issue.
Time series extreme event forecasting in Earth science research is difficult task because it is rare or invisible in the training data gathered from historical observations and it is dependent on numerous external factors that can include complex weather-ocean interaction, not only in the top-down approach of the physics law-based numerical model, but also in the data-driven bottom-up approach, machine learning or deep learning method.
Deep learning, on the other hand, has the advantage of fast and cheap inference or prediction on new data after it has been trained. Speed and economy offer a distinct advantage over many physical models in Earth science, which must be inversely solved and which require significant time and computational resources for each application. Furthermore, recent advances in deep learning techniques such as adversarial generative networks, parameterization of phenomena and emulation, and physics-aware machine learning, are yielding very promising results in estimating atmospheric convection, SST and vegetation dynamic modeling 34 .
Therefore, as a future study, we intend to develop a network architecture allowing physical constraint based on physical laws and secure applicability to predict extreme weather, climate, and ocean phenomena. Above all, we are developing an explainable architecture that can quantitatively explain the rationale for the inference result to increase the trustworthiness of the deep learning method's use.

Methods
The STANet performs two tasks simultaneously with multi-input and multi-output time series prediction for El Niño forecasts based on given 3-month sequences of SST and HC anomalies: (1) SST anomaly prediction over the Nino3.4 region (Nino3.4 index) for a 23-month forecast lead time and (2) calendar month classification for the given input and predicted output. The reason why SST and HC are used as inputs is that it is well known that ocean HC resides a memory for the future ENSO evolution. In addition, the SST pattern is important because it modulates equatorial wind variability, which is a key to the ENSO evolution 7,35,36 .
For sequential SST anomaly prediction from t + 1 to t + 23 , the STANet has a novel architecture based on a stateful LSTM with a spatiotemporal attention mechanism. During the learning process for SST anomaly prediction, the calendar month classification assists the STANet learning distinguishable features between seasonal inputs. For long-sequence Nino3.4 indices prediction, we also employed a correlation loss, smooth L1 loss, and cross-entropy loss in model training. In the sections that follow, we will first introduce the STANet architecture and will then go over the implementation details of its submodules for multitask learning.  Figure 9 presents the entire architecture of the STANet. The first module of the STANet is a spatiotemporal encoding module that uses three-dimensional (3D) convolution layers with varying filter sizes, dilation factors, and residual connections to capture relevant features of given inputs in spatial and temporal domains. The second module is a stateful LSTM with a spatial attention mechanism to predict the 23-month SST anomaly values for long input sequences from the latent features of the encoding module. The final module is a classification module with two fully connected layers that are followed by hyperbolic tangent and softmax activation functions. The input of the STANet consists of a 3-month consecutive input sequence of global SST and HC anomaly maps that are concatenated along feature dimensions. As a result, the input is a T × G lon × G lat × W × F init tensor where G lon denotes the size of the longitude grids and G lat denotes the size of the latitude grids. W is 3 as the observation window size of a given input. F init is the number of features, which is 2 for the SST and HC anomaly maps. T indicates the length of training or prediction period. In our approach, we used the stateful LSTM to give the network sequential inputs for the entire prediction period at once to understand long-term dependencies along time steps. For training, T was 32 consecutive months, which was determined through a grid search 37 . The STANet provides two outputs: consecutive 23 Nino3.4 indices for 23-month forecast lead time and a calendar month for given input sequences. Thus, the predicted Nino3.4 indices output is represented as a T × L matrix, where L is the total forecast lead time (23 months), and the classification output is a T × 12 matrix. The calendar month for each input is determined as the month of the maximum probability. Figure 9 shows the prediction process of the STANet for the prediction period.
Spatiotemporal encoding module with receptive field blocks. To extract the salient features of the given SST and HC anomaly maps for multitask learning, a 3D convolutional sub-network, named spatiotemporal encoding module, takes the first step of the STANet. In particular, rather than connecting several simple convolution layers as suggested in 12,15 , a receptive field block 38 (RFB) is introduced into the module to learn spatial features globally by extending the receptive field of the network efficiently. Figure 10 shows the detailed structure of the spatiotemporal encoding module and RFB. Figure 9. Architecture of the proposed STANet. "FCs" means two sequential fully-connected layers with ReLU and softmax functions respectively for the calendar month classification. "RFB Block" is a receptive field block with 3D convolution layers. "LSTM" is the stateful long short-term memory module. www.nature.com/scientificreports/ As inspired by the receptive field of the human visual system, an individual RFB includes multi-branch convolutional pooling with varying kernels corresponding to the receptive field of different sizes. Because the STANet receives inputs in 3D space (longitude, latitude, and time), we implement a 3D RFB with 3D convolutional layers. In the RFB, dilated convolutional layers with various dilatation rates from 1 to 7 are assigned to each branch. Because the observation window (W) is only 3, we consistently apply dilation rate 1 for the temporal dimension. The final representation of the latent features is then produced by concatenation following 1 × 1 × 1 convolution layers in all branch (see Fig. 10). All convolution layers in the RFB are followed by a ReLU activation function. Each RFB block is followed by max-pooling with a kernel size of 2 × 2 × 1 and strides (2, 2, 1) to downsample the feature maps. The STANet encodes the spatiotemporal features of SST and HC anomaly inputs, denoted as z t , by sequentially stacking three RFBs with max-pooling and applying a flatten operation for the spatial and temporal dimensions to be given to the stateful LSTM module and classification module. z t is represented in E × F enc dimension, and for the entire prediction period, the STANet provides T spatiotemporal feature vectors. Here, F enc indicates the number of spatiotemporal features.
Stateful LSTM module with attention mechanism. We implemented a stateful LSTM sub-network (stateful LSTM module) inspired by a video captioning LSTM network with visual attention 39 to predict the Nino3.4 indices of next L months for the prediction period (T) from the spatiotemporal features. At every time step during T, this module predicts L-dimensional vectors of Nino3.4 index predictions conditioned on a context vector and the previous hidden state. The operations of the LSTM cell are defined as follows: Figure 10. Structure of the receptive field block (RFB). "Conv" denotes convolution layer followed by ReLU activation function. "Rate" is the dilation rate of defining a spacing between the values in 3D kernels. The RFB includes a residual skip-connection to train deeper networks. where z t,i is a F enc -dimension vector, i = 1, · · · , E , of z t at different spatial locations and observation time ( 1, · · · , W). In Eq. (1), ẑ t is the context vector which is the attentive context representation of z t by an attention mechanism 39,40 . The attention mechanism, denoted as φ , computes ẑ t as follows: α t,i is weight for each z t,i , determined by an simple attention model ( f att , see Fig. 11), which is a multi-layer perceptron conditioned on the previous hidden state ( h t−1 ).

Scientific Reports
Here, β t is a gating scalar which is obtained by a fully-connected layer with sigmoid function from the previous hidden state ( h t−1 ). This attention mechanism encourages the STANet to focus on a relevant spatial location and observation time for Nino3.4 index prediction while understanding the contextual changes as well as the input sequences for the prediction period. Figure 11 shows the entire structure of the attention mechanism.
In the practical implementation of the STANet, T consecutive input sequences are given to a network as a minibatch. To learn the characteristic patterns of consecutive inputs, the STANet processes them sequentially without random indexing separately and saves the LSTM network states.
Losses for STANet. The STANet is trained in end-to-end manner with three losses: (1) smooth L1 loss, (2) Pearson correlation loss, and (3) cross-entropy loss. The first two losses are employed for sequential SST anomaly prediction in the prediction period. The last loss is for calendar month classification. The smooth L1 loss assists the STANet minimizing the differences between the predicted Nino3.4 index and ground truth: α t,i z t,i Figure 11. Structure of the trainable attention Module. "FC" and "Broadcasting" denote a fully-connected layer and implicit copying of vector, respectively. where y l,t is an observed SST anomaly at l-th month from the input month (t) and p l,t is the predicted SST anomaly of the corresponding month. The smooth L1 loss is less sensitive to outliers than the mean squared error loss, and it helps the network learn the dominant trends of SST anomaly along time. The Pearson correlation loss is used to increase the correlation between Y t and P t which are 23-dimension vector for observing and predicting SST anomalies, respectively. The correlation loss is defined as: The cross-entropy loss is for the calendar month classification for each input: where s i,t is 1 for the corresponding calendar month of each input and 0 for others. ŝ i,t is the output of the softmax function. The STANet is trained using the weighted sum of all losses: where α is 0.7, β is 0.2, and γ is 0.2 in this study.

Experiments
Data. Three kinds of datasets were used: historical simulation, reanalysis, and observation data as described in Table 1 Although the data periods are overlapped between SODA/GODAS and CMIP5, there is no dependency between the two datasets. In CMIP5 simulation, only long-term radiative forcing such as greenhouse gases and aerosols are prescribed, and it does not affect individual El Niño events, which have an interannual time scale. For example, the observed ENSO index and the simulated one show almost no correlation, suggesting they are completely independent. Furthermore, as the observation data for the Nino3.4 index, Optimum Interpolation SST version 2 (OISST v2) of the period 1982-2017 is used to compare the predictions 43 .
Model implementation. Two input variables of SST and HC anomalies with a length of input sequence of 3 consecutive months were used to train a network to predict the Nino3.4 index of forecast lead time ranging from 1 to 23 months from 21 CMIP5 datasets. We used the SODA data as a tuning set to find optimal parameters using early stopping. Using the GODAS datasets from 1980 to 2017, the proposed network's performance in Nino3.4 index prediction and calendar month classification was evaluated.
The parameters of the proposed network were randomly initialized using orthogonal initialization 44,45 . The network is trained using the Adam optimizer with learning rate ( η ) 0.001, β 1 0.9, and β 2 0.999. For the data augmentation, we added random Gaussian noise with γ 0.2 to the SST and HC anomaly input maps. Model training and all experiments were conducted on a workstation with NVidia TITAN XP (12 GB memory), Intel i9 CPU, and 48 GB main memory. The batch size was 32 for the training. The proposed network and algorithms were implemented using Python 3.6.9 and PyTorch 1.8.1.  Table 2. First, we constructed three models without the STANet's key components: RFB blocks for expanding the receptive field, an attention mechanism for focusing on relevant spatiotemporal features, and an LSTM module for learning the temporal patterns of input sequences for the prediction period. For the first model, we only replaced the RFB blocks with simple 3D (longitude, latitude, and time) convolution layers with the same activation functions as the RFB blocks. The second model was implemented only by removing the attention module from the STANet.
We built the last model by changing the LSTM module to a gated recurrent unit (GRU) 29 . The GRU has updated and reset gates to solve the vanilla recurrent neural network's vanishing gradient problem. Because the GRU outperformed the LSTM in learning long-term sequences on smaller and less frequent datasets, we chose it for the ablation study to validate the efficacy of the proposed method with the LSTM in learning long-term sequences. For a comparison study with a state-of-the-art model, we built a 2D CNN model 15 using the authors' Tensor-Flow codes, available at https:// github. com/ jeong hwan7 23/A_ CNN. The 2D CNN consists of three 2D convolution layers, followed by hyperbolic tangent (tanh) activation function and max pooling layer, two fully-connected layers with tanh function, and two separate fully-connected layers for Nino3.4 index prediction and calendar month classification. For the calendar month classification, the 2D CNN uses softmax function. The 2D CNN was trained to reduce mean squared error loss and cross-entropy loss for the Nino3.4 index prediction and the calendar month classification, respectively. In this study, the 2D CNN was trained using the same data as the same conditions for comparative evaluation with the proposed STANet.  License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.